MDL-based DCG Induction for NP Identification

نویسنده

  • Miles Osborne
چکیده

We introduce a learner capable of automatically extending large, manually written natural language Definite Clause Grammars with missing syntactic rules. It is based upon the Minimum Description Length principle , and can be trained upon either just raw text, or else raw text additionally annotated with parsed corpora. As a demonstration of the learner, we show how full Noun Phrases (NPs that might contain pre or post-modifying phrases and might also be recursively nested) can be identified in raw text. Preliminary results obtained by varying the amount of syntactic information in the training set suggests that raw text is less useful than additional NP bracketing information. However, using all syntactic information in the training set does not produce a significant improvement over just bracketing information. 1 Introduction Identification of Noun Phrases (NPs) in free text has been tackled in a number of ways (for example, [25, 9, 2]). Usually however, only relatively simple NPs, such as 'base' NPs (NPs that do not contain nested NPs or postmodifying clauses) are recovered. The motivation for this decision seems to be pragmatic, driven in part by a lack of technology capable of parsing large quantities of free text. With the advent of broad coverage grammars (for example [15] and attendant efficient parsers [11], however, we need not make this restriction: we now can identify 'full' NPs, NPs that might contain pre and/or post-modifying complements, in free text. Full NPs m'e more interesting than base NPs to estimate: • They are (at least) context free, unlike base NPs which are finite state. They can contain pre-and post-modifying phrases, and so proper identification can in the worst case imply full-scale pars-ing/grammar learning. • Recursive nesting of NPs means that each nominal head needs to be associated with each NP. Base NPs simply group all potential heads together in a flat structure. As a (partial) response to these challenges, we identify full NPs by treating the task as a special case of full-scale sentential Definite Clause Grammar (DCG) learning. Our approach is based upon the Minimum Description Length (MDL) principle. Here, we do not explain MDL, but instead refer the reader to the literature (for example, see [26, 27, 29, 12, 22]). Although a DCG learning approach to NP identification is far more computationally demanding than any other NP learning technique reported, it does provide a useful test-bed for exploring some of the (syntactic) factors involved …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DCG Induction using MDL and Parsed

We show how partial models of natural language syntax (manually written DCGs, with parameters estimated from a parsed corpus) can be automatically extended when trained upon raw text (using MDL). We also show how we can use a parsed corpus as an alternative constraint upon estimation. Empirical evaluation suggests that a parsed corpus is more informative than a MDL-based prior. However , best r...

متن کامل

A Simple Transformation for O ine-Parsable Grammars and its Termination Properties

We present, in easily reproducible terms, a simple transformation for o ineparsable grammars which results in a provably terminating parsing program directly top-down interpretable in Prolog. The transformation consists in two steps: (1) removal of empty-productions, followed by: (2) left-recursion elimination. It is related both to left-corner parsing (where the grammar is compiled, rather tha...

متن کامل

Complexity of Model Checking for Modal Dependence Logic

Modal dependence logic (MDL) was introduced recently by Väänänen. It enhances the basic modal language by an operator =(·). For propositional variables p1, . . . , pn the atomic formula =(p1, . . . , pn−1, pn) intuitively states that the value of pn is determined solely by those of p1, . . . , pn−1. We show that model checking for MDL formulae over Kripke structures is NPcomplete and further co...

متن کامل

Properties of Bayesian Belief Network Learning Algorithms

In this paper the behavior of various be­ lief network learning algorithms is stud­ ied. Selecting belief networks with cer­ tain minimallity properties turns out to be NP-hard, which justifies the use of search heuristics. Search heuristics based on the Bayesian measure of Cooper and Her­ skovits and a minimum description length (MDL) measure are compared with re­ spect to their properties for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999